A note on the Bayesian regret of Thompson Sampling with an arbitrary prior
نویسندگان
چکیده
We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We show that for any prior distribution, the Thompson Sampling strategy achieves a Bayesian regret bounded from above by 14 √ nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by 1 20 √ nK. In this paper we are interested in the Bayesian multi-armed bandit problem which can be described as follows. Let π0 be a known distribution over some set Θ, and let θ be a random variable distributed according to π0. For i ∈ [K], let (Xi,s)s≥1 be identically distributed random variables taking values in [0, 1] and which are independent conditionally on θ. Denote μi(θ) := E(Xi,1|θ). Consider now an agent facing K actions (or arms). At each time step t = 1, . . . n, the agent pulls an arm It ∈ [K]. The agent receives the reward Xi,s when he pulls arm i for the sth time. The arm selection is based only on past observed rewards and potentially on an external source of randomness. More formally, let (Us)s≥1 be an i.i.d. sequence of random variables uniformly distributed on [0, 1], and let Ti(s) = ∑s t=1 1It=i, then It is a random variable measurable with respect to σ(I1, X1,1, . . . , It−1, XIt−1,TIt−1 (t−1), Ut). We measure the performance of the agent through the Bayesian regret defined as
منابع مشابه
A Note on Information-Directed Sampling and Thompson Sampling
This note introduce three Bayesian style Multi-armed bandit algorithms: Information-directed sampling, Thompson Sampling and Generalized Thompson Sampling. The goal is to give an intuitive explanation for these three algorithms and their regret bounds, and provide some derivations that are omitted in the original papers.
متن کاملThompson Sampling for Online Learning with Linear Experts
In this note, we present a version of the Thompson sampling algorithm for the problem of online linear generalization with full information (i.e., the experts setting), studied by Kalai and Vempala, 2005. The algorithm uses a Gaussian prior and time-varying Gaussian likelihoods, and we show that it essentially reduces to Kalai and Vempala’s Follow-thePerturbed-Leader strategy, with exponentiall...
متن کاملComplex Bandit Problems and Thompson Sampling
We study stochastic multi-armed bandit settings with complex actions derived from the basic bandit arms, e.g., subsets or partitions of basic arms. The decision maker is faced with selecting at each round a complex action instead of a basic arm. We allow the reward of the complex action to be some function of the basic arms’ rewards, and so the feedback observed may not necessarily be the rewar...
متن کاملInformation Directed Sampling and Bandits with Heteroscedastic Noise
In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy function evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and satisfies a tail bound taken uniformly on the domain. In this work, we consider the setting of heteroscedastic noise, that is, we explicitly allow the noise distribution to depend on ...
متن کاملNonparametric General Reinforcement Learning
Reinforcement learning problems are often phrased in terms of Markov decision processes (MDPs). In this thesis we go beyond MDPs and consider reinforcement learning in environments that are non-Markovian, non-ergodic and only partially observable. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1304.5758 شماره
صفحات -
تاریخ انتشار 2013